Data Set Information:

This is a large dataset of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation.

The features of each instance and their definition are as follows:

Some of the required libraries..

Data & Exploratory Data Analysis

Plots for Better Understanding...

We can see above they have similar characteristics and we can not say too much about a special feature of any of them, would be better to apply this method but for the words.

This is why, we are goind to compute and display a histogram of the number of words in the headlines of each topic:

Cleaning:

We have to consider that the instances included in the file were obtained by several sources such as: Facebook, LinkedIn and Google Plus, as we know and have seen in some of the headlines are included characters which are not alphanumeric and belong to tags used in social media, these can be useful in specific tasks (mention or hashtag), however in the scope of topic classification such tags are not significatively useful. This is why in the current step we will get rid of them so as to keep only the corpus or core data:

Models

According to Bert we are able to classifiy topic with accuracy of 98 percent correct.

Prediction for validation set: Let us predict the topics for the validation set and print the first 5 so as to compare them and compute the error metrics:

As we know the output corresponds to arrays of 5 values corresponding to probability of instances belonging to each class, from these we have to obtain the maximum of them (argmax) and we will get the class predicted:

Above model is able to predict / classifiy Topic with 94 % accuracy.

GLOVE-MODELS

Pre-trained model 1: This model has to be similar to the scratch one in terms of layers so as to compare both and quantify the impact of using Glove weights in the embedding layer:

Pre-trained model 2: This model will contain layers we have not used yet, SpatialDropout1D which is the equivalent of dropout in dense layers, but applied to 1D-sequences, then a LSTM with 45 units (equal to the sentence length) and finally a Dense softmax layer:

Pre-trained model 3: We will go back to the first architecture and change the GlobalAveragePooling1D with LSTM of 45 units followed by 2 Dense layers, let us see the performance:

Pre-trained models 1,2 & 3 are able to classifiy topic with an average of 89% of accuracy.

CONTENT-BASED RECOMMENDATIONS

The main output of interest is the "Content-Based Recommendations" section, which will display the top recommended articles that are similar in content to the input article title.

Please note that the specific recommendations will depend on the input article title we provide, and the recommendations will be different for different titles.